PS:由于调库太多,有时会忘记底层的算法如何实现,即使看paper知道算法是如何运算,也无法熟练的代码落地,所以,温习一下一些机器学习的基本名词解释和具体优化算法。

牛顿法和拟牛顿法也是求解无约束最优化问题的常用方法,有着迭代速度快的优点,由于每一步都要求解目标函数的海赛矩阵的逆矩阵,计算较为复杂,拟牛顿法通过近似的正定矩阵去代替计算较为复杂的海赛矩阵的逆矩阵,简化了计算复杂度

牛顿法

无约束最优化问题:
minxRnf(x)\min_{x \in R^{n}} f\left(x\right)

其中xx^{*}为目标函数的极小点。

假设条件f(x)f\left(x\right)为具有二阶连续偏导,第k次迭代值为x(k)x^{\left(k\right)},则可将f(x)f\left(x\right)x(k)x^{\left(k\right)}附近进行二阶泰勒展开:

f(x)=f(x(k))+gkT(xx(k))+12(xx(k))TH(x(k))(xx(x))f\left(x\right) = f\left(x^{\left(k\right)}\right)+g_{k}^{T}\left(x-x^{\left(k\right)}\right)+\dfrac{1}{2}\left(x-x^{\left(k\right)}\right)^{T} H\left(x^{\left(k\right)}\right)\left(x-x^{\left(x\right)}\right)

其中,gk=g(x(k))=f(x(k))g_{k}=g\left(x^{\left(k\right)}\right)=\nabla f\left(x^{\left(k\right)}\right)f(x)f\left(x\right)的梯度向量在点x(k)x^{\left(k\right)}的值,H(x(k))H\left(x^{\left(k\right)}\right)f(x)f\left(x\right)的海赛矩阵
H(x)=[2fxixj]n×nH\left(x\right)=\left[\dfrac{\partial^{2}f}{\partial x_{i} \partial x_{j}}\right]_{n \times n}
在点x(k)x^{\left(k\right)}的值。

函数f(x)f\left(x\right)有极值的必要条件是在极值点处一阶导数为0,即梯度向量为0。特别的当H(x(k))H\left(x^{\left(k\right)}\right)是正定矩阵时,函数f(x)f\left(x\right)的极值为极小值。(相当于 标量函数的二阶导数大于0)

假设x(k+1)x^{\left(k+1\right)}满足
f(x(k+1))=0\nabla f\left(x^{\left(k+1\right)}\right)=0
根据二阶泰勒展开,得
f(x)=gk+Hk(xx(x))\nabla f\left(x\right)=g_{k}+H_{k}\left(x-x^{\left(x\right)}\right)
其中,Hk=H(x(k))H_{k}=H\left(x^{\left(k\right)}\right),则
gk+Hk(x(k+1)x(x))=0g_{k}+H_{k}\left(x^{\left(k+1\right)}-x^{\left(x\right)}\right)=0
x(k+1)=x(k)Hk1gkx^{\left(k+1\right)}=x^{\left(k\right)}-H_{k}^{-1}g_{k}

Hkpk=gkH_{k}p_{k}=-g_{k}

x(k+1)=x(k)+pkx^{\left(k+1\right)}=x^{\left(k\right)}+p_{k}

算法计算过程如下
牛顿法:
输入:目标函数f(x)f\left(x\right),梯度g(x)=f(x)g\left(x\right)=\nabla f\left(x\right),海赛矩阵H(x)H\left(x\right),精度要求ε\varepsilon
输出:f(x)f\left(x\right)的极小点xx^{*}

  1. 取初始点x(0)x^{\left(0\right)},置k=0k=0
  2. 计算gk=g(x(k))g_{k}=g\left(x^{\left(k\right)}\right)
  3. gk<ε\|g_{k}\| < \varepsilon则停止计算,得近似解x=x(k)x^{*}=x^{\left(k\right)}
  4. 计算Hk=H(x(k))H_{k}=H\left(x^{\left(k\right)}\right),并求pkp_{k}
    Hkpk=gkH_{k}p_{k}=-g_{k}
  5. x(k+1)=x(k)+pkx^{\left(k+1\right)}=x^{\left(k\right)}+p_{k}
  6. k=k+1k=k+1,转2.

拟牛顿法

思路 考虑用一个n阶矩阵GkG_{k}来代替海赛矩阵的逆矩阵Hk1H_{k}^{-1}
x=x(k+1)x=x^{\left(k+1\right)},由
f(x)=gk+Hk(xx(x))\nabla f\left(x\right)=g_{k}+H_{k}\left(x-x^{\left(x\right)}\right)

gk+1gk=Hk(x(k+1)x(x))g_{k+1}-g_{k}=H_{k}\left(x^{\left(k+1\right)}-x^{\left(x\right)}\right)
yk=gk+1gkδk=x(k+1)x(k)y_{k}=g_{k+1}-g_{k},\delta_{k}=x^{\left(k+1\right)}-x^{\left(k\right)},则
yk=Hkδky_{k}=H_{k}\delta_{k}
Hk1yk=δkH_{k}^{-1}y_{k}=\delta_{k}称为拟牛顿条件。

如果HkH_{k}是正定矩阵(Hk1H_{k}^{-1}也是正定),则可以保证牛顿法搜索方向是下降方向,存在搜索方向pk=λgkp_{k}=-\lambda g_{k}

x(k+1)=x(k)Hk1gkx^{\left(k+1\right)}=x^{\left(k\right)}-H_{k}^{-1}g_{k}

x=x(k)λHk1gk=x(k)+λpkx=x^{\left(k\right)}-\lambda H_{k}^{-1} g_{k}=x^{\left(k\right)}+\lambda p_{k}
f(x)f\left(x\right)x(k)x^{\left(k\right)}的泰勒展开可近似为
f(x)=f(x(k))λgkTHk1gkf\left(x\right)=f\left(x^{\left(k\right)}\right)-\lambda g_{k}^{T} H_{k}^{-1} g_{k}
由于Hk1H_{k}^{-1}正定,故gkTHk1gk>0g_{k}^{T} H_{k}^{-1} g_{k} > 0。当λ\lambda为一个充分小的正数时,有f(x)<f(x(x))f\left(x\right) < f\left(x^{\left(x\right)}\right),即搜索方向pkp_{k}是下降方向。

拟牛顿法将GkG_{k}视作Hk1H_{k}^{-1}的近似,满足正定条件后,还要满足上述拟牛顿条件,按照拟牛顿条件,每次更新迭代时可以选择更新矩阵Gk+1G_{k+1}:
Gk+1=Gk+GkG_{k+1}=G_{k}+∇G_{k},具体有以下几种实现方法:

DFP算法

DFP算法中选择GkG_{k}作为Hk1H_{k}^{-1}的近似,假设每一步迭代中矩阵Gk+1G_{k+1}是由GkG_{k}加上两个附加项构成,即Gk+1=Gk+Pk+QkG_{k+1}=G_{k}+P_{k}+Q_{k}
其中,PkP_{k}QkQ_{k}是待定矩阵。则Gk+1yk=Gkyk+Pkyk+QkykG_{k+1}y_{k}=G_{k}y_{k}+P_{k}y_{k}+Q_{k}y_{k}
为使Gk+1G_{k+1}满足拟牛顿条件,可使PkP_{k}QkQ_{k}满足
Pkyk=δkP_{k}y_{k}=\delta_{k}
Qkyk=GkykQ_{k}y_{k}=-G_{k}y_{k}
可取
Pk=δkδkTδkTykP_{k}= \dfrac{\delta_{k} \delta_{k}^{T}}{\delta_{k}^{T} y_{k}}
Qk=GkykykTGkykTGkykQ_{k}=- \dfrac{G_{k}y_{k}y_{k}^{T}G_{k}}{y_{k}^{T}G_{k}y_{k}}
可得矩阵Gk+1G_{k+1}的迭代公式
Gk+1Gk+δkδkTδkTykGkykykTGkykTGkykG_{k+1}=G_{k}+\dfrac{\delta_{k} \delta_{k}^{T}}{\delta_{k}^{T} y_{k}}- \dfrac{G_{k}y_{k}y_{k}^{T}G_{k}}{y_{k}^{T}G_{k}y_{k}}
可以证明,如果初始矩阵G0G_{0}是正定的,则迭代过程中的每个矩阵GkG_{k}都是正定的。

算法运算过程:
输入:目标函数f(x)f\left(x\right),梯度g(x)=f(x)g\left(x\right)=\nabla f\left(x\right),精度要求ε\varepsilon
输出:f(x)f\left(x\right)的极小点xx^{*}

  1. 取初始点x(0)x^{\left(0\right)},取G0G_{0}为正定矩阵,置k=0k=0
  2. 计算gk=g(x(k))g_{k}=g\left(x^{\left(k\right)}\right)gk<ε\|g_{k}\| < \varepsilon则停止计算,得近似解x=x(k)x^{*}=x^{\left(k\right)};否则,转3.
  3. pk=Gkgkp_{k}=-G_{k}g_{k}
  4. 一维搜索,求λk\lambda_{k}使
    f(x(k)+λkpk)=minλ0f(x(k)+λpk)f\left(x^{\left(k\right)}+\lambda_{k}p_{k}\right)=\min_{\lambda \geq 0} f\left(x^{\left(k\right)}+\lambda p_{k}\right)
  5. x(k+1)=x(k)+λpkx^{\left(k+1\right)}=x^{\left(k\right)}+\lambda p_{k}
  6. 计算gk+1=g(x(k+1))g_{k+1}=g\left(x^{\left(k+1\right)}\right),若gk+1<ε\|g_{k+1}\| < \varepsilon,则停止计算,的近似解x=x(k+1)x^{*}=x^{\left(k+1\right)};否则,计算
    Gk+1Gk+δkδkTδkTykGkykykTGkykTGkykG_{k+1}=G_{k}+\dfrac{\delta_{k} \delta_{k}^{T}}{\delta_{k}^{T} y_{k}}- \dfrac{G_{k}y_{k}y_{k}^{T}G_{k}}{y_{k}^{T}G_{k}y_{k}}
  7. k=k+1k=k+1,转3.

BFGS算法

BFGS算法中选择BkB_{k}逼近海赛矩阵HkH_{k},相应的拟牛顿条件
Bk+1δk=ykB_{k+1} \delta_{k} = y_{k}
假设每一步迭代中矩阵Bk+1B_{k+1}是由BkB_{k}加上两个附加项构成,即
Bk+1=Bk+Pk+QkB_{k+1}=B_{k}+P_{k}+Q_{k}
其中,PkP_{k}QkQ_{k}是待定矩阵。则
Bk+1yk=Bkyk+Pkyk+QkykB_{k+1}y_{k}=B_{k}y_{k}+P_{k}y_{k}+Q_{k}y_{k}
为使Bk+1B_{k+1}满足拟牛顿条件,可使PkP_{k}QkQ_{k}满足
Pkδk=ykP_{k}\delta_{k}=y_{k}
Qkδk=BkykδkQ_{k}\delta_{k}=-B_{k}y_{k}\delta_{k}
可取
Pk=ykykTykTδkP_{k}= \dfrac{y_{k}y_{k}^{T}}{y_{k}^{T}\delta_{k} }
Qk=BkδkδkTBkδkTBkδkQ_{k}=- \dfrac{B_{k}\delta_{k}\delta_{k}^{T}B_{k}}{\delta_{k}^{T}B_{k}\delta_{k}}
可得矩阵Bk+1B_{k+1}的迭代公式
Bk+1Bk+ykykTykTδkBkδkδkTBkδkTBkδkB_{k+1}=B_{k}+\dfrac{y_{k}y_{k}^{T}}{y_{k}^{T}\delta_{k} }- \dfrac{B_{k}\delta_{k}\delta_{k}^{T}B_{k}}{\delta_{k}^{T}B_{k}\delta_{k}}
可以证明,如果初始矩阵B0B_{0}是正定的,则迭代过程中的每个矩阵BkB_{k}都是正定的。

算法运算过程:
输入:目标函数f(x)f\left(x\right),梯度g(x)=f(x)g\left(x\right)=\nabla f\left(x\right),精度要求ε\varepsilon
输出:f(x)f\left(x\right)的极小点xx^{*}

  1. 取初始点x(0)x^{\left(0\right)},取B0B_{0}为正定矩阵,置k=0k=0
  2. 计算gk=g(x(k))g_{k}=g\left(x^{\left(k\right)}\right)gk<ε\|g_{k}\| < \varepsilon则停止计算,得近似解x=x(k)x^{*}=x^{\left(k\right)};否则,转3.
  3. Bkpk=gkB_{k}p_{k}=-g_{k}求出pkp_{k}
  4. 一维搜索,求λk\lambda_{k}使
    f(x(k)+λkpk)=minλ0f(x(k)+λpk)f\left(x^{\left(k\right)}+\lambda_{k}p_{k}\right)=\min_{\lambda \geq 0} f\left(x^{\left(k\right)}+\lambda p_{k}\right)
  5. x(k+1)=x(k)+λpkx^{\left(k+1\right)}=x^{\left(k\right)}+\lambda p_{k}
  6. 计算gk+1=g(x(k+1))g_{k+1}=g\left(x^{\left(k+1\right)}\right),若gk+1<ε\|g_{k+1}\| < \varepsilon,则停止计算,的近似解x=x(k+1)x^{*}=x^{\left(k+1\right)};否则,计算
    Bk+1Bk+ykykTykTδkBkδkδkTBkδkTBkδkB_{k+1}=B_{k}+\dfrac{y_{k}y_{k}^{T}}{y_{k}^{T}\delta_{k} }- \dfrac{B_{k}\delta_{k}\delta_{k}^{T}B_{k}}{\delta_{k}^{T}B_{k}\delta_{k}}
  7. k=k+1k=k+1,转3.

Broyden算法

对BFGS算法迭代公式,若记Gk=Bk1,Gk+1=Bk+11G_{k}=B_{k}^{-1},\quad G_{k+1}=B_{k+1}^{-1}
两次应用Sherman-Morrison公式,得
Gk1=(IδkykTδkTyk)Gk(IδkykTδkTyk)T+δkδkTδkTykG_{k+1}=\left(I- \dfrac{\delta_{k}y_{k}^{T}}{\delta_{k}^{T}y_{k}}\right)G_{k}\left(I-\dfrac{\delta_{k}y_{k}^{T}}{\delta_{k}^{T}y_{k}}\right)^{T}+\dfrac{\delta_{k}\delta_{k}^{T}}{\delta_{k}^{T}y_{k}}
称为BFGS算法关于GkG_{k}的迭代公式。
令由DFP算法GkG_{k}的迭代公式得到的Gk+1G_{k+1}记作GDFPG^{DFP},由BFGS算法GkG_{k}的迭代公式得到的Gk+1G_{k+1}记作GBFGSG^{BFGS}
由于GDFPG^{DFP}GBFGSG^{BFGS}均满足拟牛顿条件,
则两者的线性组合
Gk+1=αGDFP+(1α)GBFGSG_{k+1}=\alpha G^{DFP}+\left(1-\alpha\right) G^{BFGS}
也满足拟牛顿条件,而且是正定的。其中,0α10 \leq \alpha \leq 1。该类算法称为Broyden类算法。